Selective inference for k-means clustering

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.


Introduction
Testing for a difference in means between two groups is one of the most fundamental tasks in statistics, with numerous applications.If the groups under investigation are pre-specified, i.e., not a function of the observed data, then classical hypothesis tests will control the Type I error rate.However, it is increasingly common to want to test for a difference in means between groups that are defined through the observed data, e.g., via the output of a clustering algorithm.For instance, in single-cell RNA-sequencing analysis, researchers often first cluster the cells, and then test for a difference in the expected gene expression levels between the clusters to quantify up-or down-regulation of genes, annotate known cell types, and identify new cell types (Grün et al., 2015;Aizarani et al., 2019;Lähnemann et al., 2020;Zhang et al., 2019;Doughty & Kerkhoven, 2020).In fact, the inferential challenges resulting from testing data-guided hypotheses have been described as a "grand challenge" in the field of genomics (Lähnemann et al., 2020), and papers in the field continue to overlook this issue: as an example, seurat (Stuart et al., 2019), the state-of-the-art single-cell RNA sequencing analysis tool, tests for differential gene expression between groups obtained via clustering, with a note that "p-values [from these hypotheses] should be interpreted cautiously, as the genes used for clustering are the same genes tested for differential expression."Testing data-guided hypothesis also arises in the field of neuroscience (Kriegeskorte et al., 2009;Button, 2019), social psychology (Hung & Fithian, 2020), and physical sciences (Friederich et al., 2020;Pollice et al., 2021).When the null hypothesis is a function of the data, classical tests that do not account for this will fail to control the Type I error.
In this paper, we develop a test for a difference in means between two clusters estimated from applying k-means clustering (Lloyd, 1982;MacQueen et al., 1967), an extremely popular clustering algorithm with numerous applications (Xu & Wunsch, 2008;Wang & Lee, 2008;Steinley, 2006;Hand & Adams, 2015).We consider the following simple and well-studied model (Gao et al., 2020;Löffler et al., 2021;Lu & Zhou, 2016) for n observations and q features: where µ ∈ R n×q has unknown rows µ i , and σ 2 > 0 is known.Given a realization x ∈ R n×q of X, we first apply the k-means clustering algorithm to obtain C(x), a partition of the samples {1, . . ., n}.We might then consider testing the null hypothesis that the mean is the same across two estimated clusters, i.e., where Ĉ1 , Ĉ2 ∈ C(x) are estimated clusters with cardinality | Ĉ1 | and | Ĉ2 |.This is equivalent to testing H 0 : µ ν = 0 q versus H 1 : µ ν = 0 q , where and 1{A} equals 1 if the event A holds, and 0 otherwise.At first glance, we can test the hypothesis in (2) by applying a Wald test, with p-value given by where X ν 2 ∼ (σ ν 2 )χ q under H 0 .But this "naive" approach ignores the fact that H 0 is chosen based on the data, i.e., we constructed the contrast vector in (3) because Ĉ1 and Ĉ2 were obtained by clustering.Therefore, we will observe substantial differences between the cluster centroids i∈ Ĉ1 x i /| Ĉ1 | and i∈ Ĉ2 x i /| Ĉ2 |, even in the absence of true differences in their population means (left panel Figure 1).The center panel of Figure 1 illustrates that the test based on (4) does not control the selective Type I error: that is, the probability of falsely rejecting a null hypothesis, given that we decided to test it (Fithian et al., 2014).
Notably, the problem of testing for a difference in means between two groups obtained via clustering cannot be easily overcome by sample splitting, as pointed out in Gao et al. (2020) and Zhang et al. (2019).To see why, we divide the observations into a training and a test set.We apply k-means clustering on only the training set (left panel of Figure 2), and then assign the test set observations to those clusters (to obtain the center panel of Figure 2, we applied a 3-nearest neighbor classifier).Finally, we compute the naive p-values (4) only on the test set.Unfortunately, this approach does not work: while we clustered only the training data, we still used the test data to label the test observations, and consequently to construct the contrast vector ν in (3).Therefore, the Wald test based on sample-splitting remains extremely anti-conservative, as shown in the right panel of Figure 2, and does not lead to a valid test of H 0 in (2).  1) with µ = 0 100×2 and σ = 1.We apply k-means clustering to obtain three clusters.The cluster centroids are displayed as triangles.Center: Quantile-quantile plot of the naive p-values (defined in (4)) applied to 2,000 simulated datasets from (1) with µ = 0 100×2 and σ = 1.Right: Quantile-quantile plot of our proposed p-values (defined in ( 9)) applied to the same simulated datasets.In this paper, we develop a test of H 0 that controls the selective Type I error.That is, we wish to ensure that the probability of rejecting H 0 at level α, given that H 0 holds and we decided to test it, is no greater than α: (5) To develop the test, we leverage the selective inference framework, which has been applied extensively in high-dimensional linear modeling (Lee et al., 2016;Tibshirani et al., 2016;Fithian et al., 2014;Rügamer et al., 2022;Schultheiss et al., 2021;Taylor & Tibshirani, 2018;Charkhi & Claeskens, 2018;Yang et al., 2016;Loftus & Taylor, 2014), changepoint detection (Jewell et al., 2022;Hyun et al., 2021Hyun et al., , 2018;;Chen et al., 2021b;Le Duy & Takeuchi, 2021;Duy et al., 2020;Benjamini et al., 2019), and clustering (Zhang et al., 2019;Gao et al., 2020;Watanabe & Suzuki, 2021).The key insight behind selective inference is as follows: to obtain a valid test of H 0 , we need to condition on the aspect of the data that led us to test it.In our case, we chose to test the null hypothesis in (2) because Ĉ1 and Ĉ2 were obtained via k-means clustering.Therefore, we compute a p-value conditional on the event that k-means clustering yields Ĉ1 and Ĉ2 .This results in selective Type I error control (5), as seen in the right panel of Figure 1.
There is a rich literature on estimating and quantifying the uncertainty in the number of clusters (Li & Chen, 2010;Chen & Li, 2009;Chen et al., 2004;McLachlan et al., 2019;Dobriban, 2020), as well as assessing cluster stability and heterogeneity (Suzuki & Shimodaira, 2006;Kerr & Churchill, 2001;Kimes et al., 2017;Chung, 2020;Jin & Wang, 2016;Aw et al., 2021;Chung & Storey, 2015).Others have examined the asymptotic properties of clustering models from a Bayesian perspective (Guha et al., 2019;Nobile, 2004;Cai et al., 2020).In addition, k-means clustering is a special case of the expectationmaximization algorithm, which allows us to tap into the active line of research on the statistical guarantees of the expectation-maximization algorithm (Zhang & Zhang, 2014;Wang et al., 2015;Cai et al., 2019;Yi & Caramanis, 2015;Balakrishnan et al., 2017).However, most prior work focused on scenarios where the number of clusters is correctly specified, and the estimated clusters memberships are close to the truth.By contrast, we are interested in a correctly-sized test for the null hypothesis (2), even when Ĉ1 , Ĉ2 do not correspond to true clusters.In addition, existing work often relies on asymptotic approximations and bootstrap resampling.Two recent exceptions include Zhang et al. (2019) and Gao et al. (2020), who took a selective inference approach and computed finitesample p-values for testing the difference in means between estimated clusters obtained via linear classification rules and hierarchical clustering, respectively.Our work is closest to Gao et al. (2020), and extends their framework to k-means clustering.We provide an exact, finite-sample test of the difference in means between a pair of clusters estimated via k-means clustering under model (1), without the need for sample splitting.
Methods developed in this paper are implemented in the R package KmeansInference, available at https://github.com/yiqunchen/KmeansInference.Data and code for reproducing the results in this paper can be found at https://github.com/yiqunchen/KmeansInference-experiments.
Throughout this paper, we will use the following notational conventions.For a matrix A, A i denotes the ith row and A ij denotes the (i, j)th entry.For a vector ν ∈ R n , ν 2 denotes its 2 norm, and Π ⊥ ν is the projection matrix onto the orthogonal complement of ν, i.e., Π ⊥ ν = I n − νν / ν 2 2 , where I n is the n-dimensional identity matrix.Moreover, dir(ν) = ν/ ν 2 if ν = 0 n and 0 n otherwise, where 0 n is the n-vector of zeros.We let •, • and 1{•} denote the inner product of two vectors and the indicator function, respectively.

Selective inference for k-means clustering
2.1.A brief review of k-means clustering In this section, we review the k-means clustering algorithm.Given samples x 1 , . . ., x n ∈ R q and a positive integer K, k-means clustering partitions the n samples into disjoint subsets Ĉ1 , . . ., ĈK by solving the following optimization problem: minimize It is not typically possible to solve for the global optimum in (6) (Aloise et al., 2009).A number of algorithms are available to find a local optimum (Hartigan & Wong, 1979;Zha et al., 2002;Arthur & Vassilvitskii, 2007); one such approach is Lloyd's algorithm (Lloyd, 1982), given in Algorithm 1.We first sample K out of n observations as initial centroids (step 1 in Algorithm 1).We then assign each observation to the closest centroid (step 2).Next, we iterate between re-computing the centroids and updating the cluster assignments (steps 3a. and 3b.) until the cluster assignments stop changing.The algorithm is guaranteed to converge to a local optimum (Hastie et al., 2001).
In what follows, we will sometimes use c i and m k to emphasize the dependence of the cluster labels and centroids on the data x.

A test of (2) for clusters obtained via k-means clustering
Here, we briefly review the proposal of Gao et al. (2020) for selective inference for hierarchical clustering, and outline a selective test for (2) for k-means clustering.Gao et al. (2020) proposed a selective inference framework for testing hypotheses based on the output of a clustering algorithm.Let C(•) denote the clustering operator, i.e., a partition of the observations resulting from a clustering algorithm.Since H 0 in (2) is chosen because Ĉ1 , Ĉ2 ∈ C(x) , where Ĉ1 , Ĉ2 are the two estimated clusters under consideration in (2), Gao et al. (2020) proposed to reject is small.In (7), conditioning on Π ⊥ ν X = Π ⊥ ν x, dir(X ν) = dir(x ν) eliminates the nuisance parameters Π ⊥ ν µ and dir(µ ν), where Π ⊥ ν = I n − νν / ν 2 and dir(µ ν) = µ ν/ µ ν 2 (see, e.g., Section 3.1 of Fithian et al. (2014)).Gao et al. (2020) showed that the test that rejects H 0 when (7) is below α controls the selective Type I error at level α, in the sense of (5).Furthermore, under (1), the conditional distribution of X ν 2 in ( 7) is (σ ν 2 )χ q , truncated to a set.When the operator C(•) denotes hierarchical clustering, this set can be analytically characterized and efficiently computed, leading to an efficient algorithm for computing (7).
We now extend these ideas to k-means clustering (6).Since the k-means algorithm partitions all n observations, it is natural to condition on the cluster assignments of all observations rather than just on Ĉ1 , Ĉ2 ∈ C(X) .This leads to the p-value where c (T ) i (X) is the cluster assigned to the ith observation at the final iteration of Algorithm 1.However, computing (8 , which is not straightforward, and may necessitate enumerating over possibly an exponential number of intermediate cluster assignments c (t) i (•) for t = 1, . . ., T − 1.Hence, we also condition on all of the intermediate clustering assignments in Algorithm 1: In (9), c i (X) is the cluster assigned to the ith observation at the tth iteration of Algorithm 1. Roughly speaking, this p-value answers the question: Assuming that there is no difference between the population means of Ĉ1 and Ĉ2 , what is the probability of observing such a large difference between their centroids, among all the realizations of X that yield identical results in every iteration of the k-means algorithm?
The p-value in (9) is the focus of this paper.We establish its key properties below.
Proposition 1 states that p selective can be recast as the survival function of a scaled χ q random variable, truncated to the set where x (φ) is defined in (11).Therefore, to compute p selective , it suffices to characterize the set S T .In (11), x (φ) results from applying a perturbation to the observed data x, Fig. 3: One simulated dataset generated from model (1) with along the direction of x ν, the difference between the two cluster centroids of interest.Figure 3 illustrates a realization of (1) for k-means clustering with K = 3.The left panel displays the observed data x, which corresponds to x (φ) with φ = x ν 2 = 4.37.Here, ν defined in (3) was chosen to test the difference between Ĉ1 (shown in pink) and Ĉ2 (shown in blue).The center and right panels of Figure 3 display x (φ) with φ = 0 and φ = 6, respectively.In the center panel, with φ = 0, the blue and pink clusters are "pushed together", resulting in x (φ) ν 2 = 0; that is, there is no difference in empirical means between the two clusters under consideration.Applying k-means clustering no longer results in these clusters.By contrast, in the right panel, with φ = 6, the blue and pink clusters are "pulled apart" along the direction of x ν, which results in an increased distance between the centroids of the blue and pink clusters, and k-means clustering does yield the same clusters as on the original data.In this example, S T = (3.59,∞).

Computation of the selective p-value
In Section 2, we have shown that the p-value p selective (9) involves the set S T in ( 12).Here, we start with a high-level summary of our approach to characterizing S T in (12).First, we rewrite . Next, we consider the first term in the in-tersection: according to step 2. of Algorithm 1, for i = 1, . . ., n, c i (x) if and only if for i = 1, . . ., n, the initial randomly-sampled centroid to which [x (φ)] i is closest coincides with the initial centroid to which x i is closest.This condition can be expressed using K − 1 inequalities.Furthermore, the same intuition holds for the second term in the intersection, except that the centroids are a function of the cluster assignments in the previous iteration.We formalize this intuition in Proposition 2.
Proposition 2. Suppose that we apply the k-means clustering algorithm (Algorithm 1) to a matrix x ∈ R n×q , to obtain K clusters in at most T steps.Define Then, for the set S T defined in (12), we have that x (φ) Recall that c (t) i (x) denotes the cluster to which the ith observation is assigned in step 3b. of Algorithm 1 during the tth iteration, and that m (0) k (x) denotes the kth centroid sampled from the data x during step 1 of Algorithm 1.In words, Proposition 2 says that S T can be expressed as the intersection of O(nKT ) sets.Therefore, it suffices to characterize the sets in ( 14) and ( 15).We now present two lemmas.
Lemma 1 (Lemma 2 in Gao et al. (2020)).For ν in (3) and x (φ) in (11), It follows from Lemmas 1 and 2 that all of the inequalities in ( 14) and ( 15) are in fact quadratic in φ, with coefficients that can be analytically computed.Therefore, computing the set S T requires solving O(nKT ) quadratic inequalities of φ.
Proposition 3. Suppose that we apply the k-means clustering algorithm (Algorithm 1) to a matrix x ∈ R n×q , to obtain K clusters in at most T steps.Then, the set S T defined in (12) can be computed in O(nKT (n + q) + nKT log (nKT )) operations.

Non-spherical covariance matrix
Thus far, we have assumed that the observed data x is a realization of (1), which implies that cov(X i ) = σ 2 I q .However, this assumption is often violated in practice.For example, expression levels of genes are highly correlated, and neighbouring pixels in an image tend to be more similar.For a known positive definite matrix Σ, we now let Under ( 16), we can whiten the data by applying the transformation & Sejnowski, 1997), where Σ − 1 2 is the unique symmetric positive definite square root of Σ −1 (Horn & Johnson, 2012).Note that Σ − 1 2 X i ∼ N (Σ − 1 2 µ i , I q ).Moreover, as Σ − 1 2 0, testing the null hypothesis in ( 2) is equivalent to testing Therefore, to get a correctly-sized test under model ( 16), we can simply carry out our proposal in Section 2 on the transformed data Σ − 1 2 x i instead of the original data x i .Instead of applying the whitening transformation, we can directly accommodate a known covariance matrix Σ by considering the following extension of p selective in (9): Proposition 4. Suppose that x is a realization from (16), and let φ ∼ ( ν 2 )χ q .Then, under H 0 : µ ν = 0 with ν defined in (3), where p Σ,selective is defined in (18).Furthermore, the test that rejects H 0 : µ ν = 0 when p Σ,selective ≤ α controls the selective Type I error at level α.
In addition, we can adapt the results in Section 3 to compute the set modifying the results in Lemmas 1 and 2. Details are in Section A.5 of the Appendix.

Unknown variance
When σ is unknown, we can plug in an estimate σ in (9): where φ(σ) ∼ (σ ν 2 )χ q .If we use a consistent estimator of σ, then a test based on the p-value in (20) provides selective Type I error control (5) asymptotically.Proposition 5.For q = 1, 2, . . ., suppose that X (q) ∼ MN n×q µ (q) , I n , σ 2 I q .Let x (q) be a realization from X (q) and let c (t) i (•) be the cluster to which the ith observation is assigned during the tth iteration of step 3b. in Algorithm 1.Consider the sequence of null hypotheses H (q) 0 : µ (q) ν (q) = 0 q , where ν (q) defined in (3) is the contrast vector resulting from applying k-means clustering on x (q) .Suppose that (i) σ is a consistent estimator of σ, i.e., for all > 0, lim q→∞ pr |σ(X (q) ) − σ| ≥ = 0; and (ii) there exists δ ∈ (0, 1) such that > δ.Then, for all α ∈ (0, 1), we have that lim q→∞ pr In practice, we propose to use the following estimator of σ (Huber, 1981): where x is obtained from subtracting the median of each column in x, and M χ 2 1 is the median of the χ 2 1 distribution.If µ is sparse, i.e., n i=1 q j=1 1{µ ij = 0} is small, then ( 21) is consistent with appropriate assumptions; see Appendix A.7.

Simulation study
5.1.Overview Throughout this section, we consider testing the null hypothesis H 0 : µ ν = 0 q versus H 1 : µ ν = 0 q , where, unless otherwise stated, ν defined in ( 3) is based on a randomly-chosen pair of clusters Ĉ1 and Ĉ2 from k-means clustering.We consider four p-values: p Naive in (4), p selective in (9), pselective in (20) with σMED defined in (21), and pselective in (20) with σSample = , where xj = n i=1 x ij /n.In the simulations that follow, we compare the selective Type I error (5) and power of the tests that reject H 0 when these p-values are less than α = 0.05.
For each simulated dataset, we apply k-means clustering with K = 3, and then compute p Naive , p selective , pselective (σ MED ), and pselective (σ Sample ) for a randomly-chosen pair of clusters.Figure 4 displays the observed p-value quantiles versus the Uniform(0,1) quantiles.We see that for all values of q, (i) the naive p-values in (4) are stochastically smaller than a Uniform(0,1) random variable, and the test based on p Naive leads to an inflated Type I error rate; (ii) tests based p selective , pselective (σ MED ), and pselective (σ Sample ) control the selective Type I error rate in the sense of (5).Fig. 4: Quantile-quantile plots for p Naive (pink), p selective (green), pselective (σ MED ) (orange), and pselective (σ Sample ) (purple) under (1) with µ = 0 n×q , stratified by q.

Conditional power and detection probability
In this section, we show that the tests based on our proposal (p selective , pselective (σ MED ), and pselective (σ Sample )) have substantial power to reject H 0 when it is not true.We generate data from (1) with n = 150 and Here, we can think of C 1 = {1, . . ., n/3}, C 2 = {(n/3) + 1, . . ., (2n/3)}, C 3 = {(2n/3) + 1, . . ., n} as the "true clusters".Moreover, these clusters are equidistant in the sense that the pairwise distance between each pair of population means is |δ|.Recall that we test H 0 in (2) for a pair of estimated clusters Ĉ1 and Ĉ2 , which may not be true clusters.
Hence, we will separately consider the conditional power and detection probability of our proposed tests (Gao et al., 2020;Jewell et al., 2022;Hyun et al., 2021).The conditional power is the probability of rejecting H 0 in (2), given that Ĉ1 and Ĉ2 are true clusters.
Given M simulated datasets, we estimate it as where We generate M = 200, 000 datasets from ( 22) with q = 10, σ = 0.25, 0.5, 1, and δ = 2, 3, . . ., 10.For each simulated dataset, we apply k-means clustering with K = 3 and reject H 0 : µ ν = 0 q if p selective , pselective (σ MED ), or pselective (σ Sample ) is less than α = 0.05.In Figure 5, the left panel displays the detection probability (24) of k-means clustering as a function of δ in ( 22), and the right panel displays the conditional power (23) for the tests based on p selective , pselective (σ MED ), and pselective (σ Sample ).Under model (1), the detection probability and conditional power increase as a function of δ in ( 22) for all values of σ.For a given value of δ, a larger value of σ leads to lower detection probability and conditional power.The conditional power is not displayed for δ = 2, 3, σ = 1 because the true clusters were never recovered in simulation.Moreover, for a given value of δ and σ, the test based on p selective has the highest conditional power, followed closely by the test based on pselective (σ MED ).Using σSample in pselective leads to a less powerful test, especially for large values of δ.This is because σSample is a conservative estimator of σ in (1), and its bias is an increasing function of δ, the distance between true clusters.By contrast, σMED is a consistent estimator under model ( 22) (see Appendix A.7).
As an alternative to the conditional power in (23), in Appendix A.8, we consider a notion of power that does not condition on having correctly estimated the true clusters.
6. Real data applications 6.1.Palmer Penguins (Horst et al., 2020) Here we analyze the Palmer penguins dataset from the palmerpenguins package in R (Horst et al., 2020).We consider the 165 female penguins with complete observations, and apply k-means clustering with K = 4 to two of the collected features: bill depth and flipper length.Figure 6 displays the estimated clusters.We assess the equality of the means of each pair of estimated clusters using p Naive in (4) and pselective (σ MED ) in (20) with σMED defined in (21).The results are in Figure 6.The naive p-values are small for all pairs of estimated clusters, even when the underlying species distributions are nearly identical (e.g., both clusters 1 and 4 are a mix of Chinstrap and Adelie penguins).By contrast, our proposal results in large p-values when testing for a difference in means between clusters composed of the same species (clusters 1 and 4, clusters 3 and 4), and small p-values when the clusters correspond to different species (e.g., clusters 1 and 2, clusters 2 and 3).(Lecun et al., 1998) In this section, we apply our method to the MNIST dataset (Lecun et al., 1998), which consists of 60,000 gray-scale images of handwritten digits.Each image has an accompanying label in {0, 1, . . ., 9}, and is stored as a 28 × 28 matrix that takes on values in [0,255].We first divide the entries of all the images by 255.Next, since there is no variation in the peripheral pixels of the images (Gallaugher & McNicholas, 2018), which violates model (1), we add an independent perturbation N (0, 0.01) to each element of the image.Finally, we vectorize each image to obtain a vector x i ∈ R 784 .We first construct a "no cluster" dataset by randomly sampling 1,500 images of the 0s; thus, n = 1, 500 and q = 784.To de-correlate the pixels in each image, we whitened the data (see Section 4.1) using Σ− 1 2 = U (Λ + 0.01I n ) − 1 2 U as in prior work (Coates & Ng, 2012), where U ΛU is the eigenvalue decomposition of the sample covariance matrix.

MNIST Dataset
We apply k-means clustering with K = 6.The centroids are displayed in the top left panel of Figure 7.For each pair of estimated clusters, we compute the p-values p Naive and pselective (σ MED ) (see Figure 7).The naive p-values are extremely small for all pairs of clusters under consideration, despite the resemblance of the centroids.By contrast, our approach yields modest p-values, congruent with the visual resemblance of the centroids.In addition, for the most part, the pairs for which pselective (σ MED ) is small are visually quite different (e.g., clusters 1 and 2, clusters 1 and 5, and clusters 4 and 5).
To demonstrate the power of the test based on pselective (σ MED ), we also generated a "cluster" dataset by sampling 500 images each from digits 0, 1, 3, and 8; thus, n = 2, 000 and q = 784.We again whitened the data to obtain uncorrelated features.After applying k-means clustering with K = 4, we obtain four clusters that roughly correspond to four digits: cluster 1, 94.0% digit 1; cluster 2, 72.4% digit 3; cluster 3, 83.6% digit 0; cluster 4, 62.4% digit 8 (see the bottom left panel of Figure 7).Results from testing for a difference in means for each pair of clusters using p Naive and pselective (σ MED ) are in Figure 7.Both sets of p-values are small on this "cluster" dataset.
6.3.Single-cell RNA-sequencing data (Zheng et al., 2017) In this section, we apply our proposal to single-cell RNA-sequencing data collected by Zheng et al. (2017).Single-cell RNA-sequencing quantifies gene expression abundance at the resolution of single cells, thereby revealing cell-to-cell heterogeneity in transcription and allowing for the identification of cell types and marker genes.In practice, biologists often cluster the cells to identify putative cell types, and then perform a differential expression analysis, i.e., they test for a difference in gene expression between two clusters (Stuart et al., 2019;Lähnemann et al., 2020;Grün et al., 2015).Because this approach ignores the fact that the clusters were estimated from the same data used for testing, it does not control the selective Type I error.Zheng et al. (2017) profiled 68,000 peripheral blood mononuclear cells, and classified them based on their match to the expression profiles of 11 reference transcriptomes from known cell types.We consider the classified cell types to be the "ground truth", and use this information to demonstrate that our proposal in Section 2 yields reasonable results.
As in prior work (Gao et al., 2020;Duò et al., 2018), we first excluded cells with low numbers of expressed genes or total counts, as well as cells in which a large percentage of the expressed genes are mitochondrial.We then divided the counts for each cell by the total sum of counts in that cell.Finally, we applied a log 2 transformation with a pseudo-count of 1 to the expression data, and considered only the subset of 500 genes with the largest average expression levels pre-normalization.
We applied the aforementioned pre-processing pipeline separately to memory T cells (N = 10, 224) and a mixture of five types of cells (memory T cells, B cells, naive T cells, natural killer cells, and monocytes; N = 43, 259).
To investigate the selective Type I error in the absence of true clusters, we first constructed a "no cluster" dataset by randomly sampling 1,000 out of 10,224 memory T cells after pre-processing (thus, n = 1, 000 and q = 500).Since the gene expression levels are highly correlated, we first whitened the data as described in Section 4.1 by plugging in Σ− & Ng, 2012), where U ΛU is the eigenvalue decomposition of the sample covariance matrix.
We applied k-means clustering to the transformed data with K = 5, and obtained five clusters consisting of 97, 223, 172, 165, and 343 cells, respectively (see Figure 10 left panel in Appendix A.9).For each pair of estimated clusters, we computed the p-values p Naive and pselective (σ MED ).The results are displayed in the top panel of Table 1.On this dataset, the naive p-values are extremely small for all pairs of estimated clusters, while our proposed p-values are quite large.In particular, at α = 0.05, the test based on p Naive concludes that all five estimated clusters correspond to distinct cell types (even after multiplicity correction), whereas our approach does not reject the null hypothesis that the expression levels (and thus cell types) are, in fact, the same between estimated clusters.Because this "no cluster" dataset consists only of memory T cells, we believe that conclusion based on pselective (σ MED ) aligns better with the underlying biology.

Discussion
We have proposed a test for a difference in means between two clusters estimated from k-means clustering, under (1).Here, we outline several future research directions.
While the p-value in ( 9) leads to selective Type I error control, it conditions on more information than is used to construct the hypothesis in (2).In practice, data analysts likely only make use of the final cluster assignments (leading to the p-value in ( 8)), as opposed to all the intermediate assignments (leading to the p-value in ( 9)).Empirically, conditioning on too much information results in a loss of power (Fithian et al., 2014;Jewell et al., 2022;Liu et al., 2018).In future work, we will investigate the possibility of leveraging recent developments in selective inference (Chen et al., 2021a;Le Duy & Takeuchi, 2021;Jewell et al., 2022) to compute the "ideal" p-value (8).
We could also consider extending our proposal to other data generating models.The normality assumption in (1) is critical to the proof of Proposition 1, because it guarantees that under H 0 in (2), X ν 2 , dir(X ν), and Π ⊥ ν X are pairwise independent.However, this normality assumption is often violated in practice; for instance, in singlecell genomics, the data are count-valued and the variance of gene expression levels varies drastically with the mean expression levels of that gene (Stuart et al., 2019;Eling et al., 2018).This has motivated some authors to work with alternative models for gene expression including Poisson (Witten, 2011), negative binomial (Risso et al., 2018), and curved normal (Lin et al., 2021).To extend our framework to other exponential family distributions, we may be able to leverage recent proposals to decompose X into f (X) and g(X) such that both f (X) and g(X)|f (X) have a known, computationally-tractable distribution (Rasines & Alastair Young, 2021;Leiner et al., 2021).

Acknowledgments
This work was partially supported by National Institutes of Health grants and a Simons Investigator Award to Daniela Witten.
Finally, we have that pr In the proof above, a. follows from the tower property of conditional expectation, and b. is a direct consequence of (A.4).Therefore, we conclude that the test based on p selective controls the selective Type I error in (5), which completes the proof of Proposition 1.

A.2. Proof of Proposition 2
We will derive the expression for S T in Proposition 2 using an induction argument.The following two claims (Lemmas 4 and 5) serve as the "base cases" for the proof.
Lemma 4. Recall that c (t) i (x) denotes the cluster to which the ith observation is assigned during the tth iteration of step 3b. of Algorithm 1 applied to data x, and that m (0) k (x) denotes the kth centroid sampled from x during step 1 of Algorithm 1.For S 0 defined as we have that Proof.We first prove that the set in (A.5) is a subset of the set in (A.6).For an arbitrary φ 0 ∈ (A.5) and 1 ≤ i ≤ n, we have that Here, the first line follows from the definition of c (0) i in step 2 of Algorithm 1, and step a. follows from the definition of the argmin function.Step b. follows from the assumption that φ 0 ∈ (A.5) satisfies c Because this holds for an arbitrary 1 ≤ i ≤ n, we have proven that φ 0 ∈ (A.5) =⇒ φ 0 ∈ (A.6); or equivalently, (A.5) ⊆ (A.6).
We proceed to prove the other direction.For an arbitrary φ 0 ∈ (A.6) and an arbitrary 1 ≤ i ≤ n, we have that Here, step a. follows from the definition of argmin, and step b. follows from combining the definition of c (0) i (x (φ)) in step 2 of Algorithm 1.We conclude that φ 0 ∈ (A.6) =⇒ φ 0 ∈ (A.5).Combining these two directions, we have proven that (A.6) = (A.5).
Lemma 5. Recall that c (t) i (x) denotes the cluster to which the ith observation is assigned in the tth iteration of step 3b. of Algorithm 1 applied to data x, and that m (0) k (x) denotes the kth centroid sampled from x during step 1 of Algorithm 1.For S 1 defined as i (k) defined in (13), we have that ) Proof.We first prove that (A.7) ⊆ (A.8).For an arbitrary φ 0 ∈ (A.7) and an arbitrary 1 ≤ i ≤ n, we have that In the equations above, the first line follows from step 3b. of Algorithm 1 with t = 0. Next, step a. follows from the definition of (A.7), which implies that c (1) Step b. is a direct consequence of step 3a. of Algorithm 1 with t = 0.In steps c. and d., we used the definitions of the argmin function and (A.7).Finally, we apply the definition of w (t) i in (13) to get e.Because this holds for an arbitrary 1 ≤ i ≤ n, φ 0 ∈ (A.7) implies that φ 0 is an element of the second set in the intersection in (A.8).
Next, we prove that the set in (A.8) is a subset of the set in (A.7).For an arbitrary φ 0 ∈ (A.8) and an arbitrary 1 ≤ i ≤ n, we have that
Next, we will prove the inductive step in the proof of Proposition 2, which relies on the following claim.
Lemma 6. Recall that c (t) i (x) denotes the cluster to which the ith observation is assigned in the tth iteration of Algorithm 1 applied to the data x, and that m (0) k (x) denotes the kth centroid sampled from x during initialization.For some 1 ≤ T ≤ T − 1, define (A.9) Suppose that the following holds for T : where w (t) i (•) is defined in (13).Then, for S T +1 defined as .11)we have that (A.12) Proof.Using the definitions in (A.9) and (A.11), we have that Therefore, it suffices to prove that (A.13) = (A.12),under the inductive hypothesis (A.10).
Here, the first statement follows from the definition of S T +1 .Next, steps a. and b. follow from the definitions of c ( T +1) i and m ( T +1) k (x (φ 0 )) in steps 3b. and 3a. of Algorithm 1, respectively.In step c., we used the fact that φ 0 ∈ (A.13) =⇒ φ 0 ∈ S T =⇒ c T i (x (φ 0 )) = c T i (x).Finally, d. follows from the definition of w (t) i in (13).We continue with the reverse direction.Applying the inductive hypothesis (A.10), together with the definition of S T +1 in (A.12) and the definition of w (t) i in (13), we have that For an arbitrary φ 0 ∈ (A.12) and any 1 ≤ i ≤ n, the following holds: Here, to derive step a., we first note that by (A.14), any element φ 0 of (A.12) is also an element of S T .Therefore, using the definition of S T in (A.9), we have that T t=1 c i (x) , and step a. follows directly.Next, steps b. and c. follow directly from steps 3a. and 3b. of Algorithm 1 with t = T .By inspecting the form of (A.13), we conclude that φ 0 ∈ (A.12) =⇒ φ 0 ∈ (A.13).
In conclusion, we have proven that (A.12) = (A.13),which completes the proof.The inductive proof of Proposition 2 follows from combining Lemmas 4, 5 and 6.
A.3.Proof of Lemmas 1 and 2 We first prove Lemma 1, which is also Lemma 2 in Gao et al. (2020).Proof.We first express the inner product [x (φ)] i , [x (φ)] j as a function of φ.From (11), we Next, using the expression for [x (φ)] i , [x (φ)] j above, we have that This completes the proof of Lemma 1.We continue with the proof of Lemma 2. Using the definition of w , Therefore, a minor modification of Proposition 2 yields the following corollary.Corollary 1. Suppose the k-means clustering algorithm (see Algorithm 1) with K clusters the data x, when applied to the data x, runs for T steps.Then, for the set S Σ T defined in (A.16), we have that Here, step a. follows from pselective (σ) converging to p selective (σ) in probability, conditional on A q .
Step b. follows from the fact that the result of Proposition 1 applies for any positive integer q.This completes the proof of Proposition 5. Proposition 5 assumes that we have a consistent estimator σ of σ.In Appendix A.7, we analyze different estimators of σ in (1), and prove that, under appropriate sparsity assumptions on µ in (1), σMED in ( 21) is a consistent estimator for σ.
As an alternative, we can also use an asymptotically conservative estimator of σ as in Gao et al. (2020).This leads to an asymptotically conservative p-value; details are stated in Corollary 2.
Corollary 2. For q = 1, 2, . . ., suppose that X (q) ∼ MN n×q µ (q) , I n , σ 2 I q .Let x (q) be a realization from X (q) and c (t) i (•) be the cluster to which the ith observation is assigned during the tth iteration of step 3b. of Algorithm 1.Consider the sequence of null hypotheses H (q) 0 : µ (q) ν (q) = 0 q , where ν (q) defined in (3) is the contrast vector resulting from applying k-means clustering on x (q) .Suppose that (i) σ is an asymptotically conservative estimator of σ, i.e., lim q→∞ pr σ(X (q) ) ≥ σ = 1; and (ii) there exists δ ∈ (0, 1) such that > δ.Then, for all α ∈ (0, 1), we have that x (q) ≤ α.We omit the proof of Corollary 2, as it follows directly from combining the proof of Proposition 5 and the fact that pselective (σ) is a monotonically increasing function of σ (see Lemma 10).
In words, Corollary 3 states that under model (1), the rate of convergence of σ2 MED in mean (and therefore, in probability) is max 1/(nq) 1/2 , s/q .In particular, σ2 MED is a consistent estimator of σ 2 provided that s/q → 0 as q → ∞.
Next, we investigate the property of the sample variance estimator σ2 Sample .Proposition 6.Under model (1), for σ2 Sample (X) = n i=1 q j=1 X ij − Xj 2 /(nq − q), we have that

E σ2
Sample (X) − σ 2 = 1 2n(n − 1)q (A.27) Moreover, for any integers s and q such that ns ≤ q, we have that, for some constant c0 , (A.28) Proof.We start with the proof of (A.27).Under (1), the following holds: Here, the last equality follows from Langrange's identity, which states that To prove the second statement, we consider a specific matrix μ ∈ R n×q with exactly ns ≤ q non-zero entries.In addition, each column of μ has at most one non-zero entry and each row of μ has exactly s non-zero entries.This is possible because ns is assumed to be less than q.Finally, we assume that the square of the minimal non-zero entry of μ, min i,j:μij =0 μ2 ij , is lower bounded by where, q is taken to be a multiple of 10, and for δ > 0, θ ∈ R 3×0.1q has orthogonal rows, with θ i 2 2 = δ/2 for i = 1, 2, 3.As in Section 5.3, we can think of C 1 = {1, . . ., n/3}, C 2 = {(n/3) + 1, . . ., (2n/3)}, C 3 = {(2n/3) + 1, . . ., n} as "true clusters".Under (A.29), the pairwise distance between each pair of true clusters is δ.
We generate M = 100, 000 datasets from (A.29) with q = 50,σ = 0.25, 0.5, 1, and δ = 2, 3, . . ., 10.For each simulated dataset, we apply k-means clustering with K = 3 and reject H 0 : µ ν = 0 q if p selective , pselective (σ MED ), or pselective (σ Sample ) is less than α = 0.05. Figure 9(a) displays the detection probability (24) of k-means clustering as a function of δ in (A.29).Under model (1), the detection probability increases as a function of δ in (A.29) across all values of σ.For a given value of δ, a larger value of σ leads to lower detection probability.Figure 9(b) displays the conditional power (23) for the tests based on p selective , pselective (σ MED ), and pselective (σ Sample ).For some combinations of δ and σ, the conditional power is not displayed, because the true clusters are never recovered in simulation.For all tests and values of σ under consideration, conditional power is an increasing function of δ.For a given test and a value of δ, smaller σ leads to higher conditional power.Moreover, for the same values of δ and σ, the test based on p selective has the highest conditional power, followed closely by the test based of pselective (σ MED ).Using pselective (σ Sample ) leads to a less powerful test, especially for larger values of δ.As a comparison, we included the detection probability and conditional power under model ( 22) with q = 50 in panels (c) and (d) of Figure 9.The tests under consideration behave qualitatively similarly as a function of δ and σ.Under (22), we observe an even larger gap between the power of the test based on pselective (σ Sample ) and the power of the test based on pselective (σ MED ).A.9.Additional results for real data applications In this section, we visualize the estimated clusters for the single cell RNA-sequencing data in Section 6.3.Fig. 10: Left: The two-dimensional UMAP embedding (McInnes et al., 2018) of the "no cluster" dataset after preprocessing (as described in Section 6.3), colored by the estimated cluster membership via k-means clustering.Right: Same as left, but for the "cluster" dataset.

Fig. 2 :
Fig. 2: Left: One simulated dataset generated from (1) with µ = 0 100×2 and σ = 1.We apply k-means clustering on the training set to obtain three clusters.Center: We apply the training set clusters to the test set using a 3-nearest neighbors classifier.Right: Quantile-quantile plot of the naive p-values (4) applied to the test set, aggregated over 2,000 simulated datasets.

]
and σ = 1.Left: The original data x corresponds to φ = x ν 2 = 4.37.Applying k-means clustering with K = 3 yields three clusters, displayed in pink, blue, and orange.Here, ν is chosen to test for a difference in means between Ĉ1 (pink) and Ĉ2 (blue).Center: The perturbed data x (φ) with φ = 0. Applying k-means clustering with K = 3 does not yield the same set of clusters as in the left panel.Right: The perturbed data x (φ) with φ = 6.Applying k-means clustering with K = 3 yields the same set of clusters as in the left panel.

Fig. 7 :
Fig. 7: Top left: Centroids of six clusters from the "no cluster" dataset ( Ĉ1 to Ĉ6 from left to right, top to bottom).Bottom left: Same as top left, but for the "cluster" dataset.Right: We test the null hypothesis of no difference between each pair of cluster centroids using p Naive and pselective (σ MED ).Here, μi = j∈ Ĉi µ j /| Ĉi |.
φ 0 )), for all i = 1, . . ., n, k = 1, . . ., K, yielding the desired equality.Next, step c. follows from the definition of the argmin function.Finally, steps d. and e. follow directly from the definitions of m

Fig. 8 :
Fig. 8: Left : Additional analysis of the data in Section 5.3 with σ = 0.5.We fit a regression spline to display the power of the tests based on p selective (green line), pselective (σ MED ) (orange line), and pselective (σ Sample ) (purple line) as a function of µ ν 2 .Right : Same as left, but for σ = 1.

Table 1 :
P-values p Naive in (